Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.
Column Names: Meanings
Prior to running the models, the data was manipulated to remove nulls, adjusted for inconsistencies in the data, and eliminating columns that were not required.
Removing Nan values This proved to have higher R^2 values and lower number of variables that were considered to be significant. While the model appears to have a better fit, this is not the ideal scenario since ~30% of the data was lost in this process.
Filter out percentage.expenditure greather than 100. This caused additional values to be considered in significant. This is filtered considering that a population cannot spend more than 100% of their GDP on health care.
Replace Nan with means This allows for all data to be conserved.
## [1] 0
After running multiple regression models (forward, backward, and stepwise), it was determined that the following variables are the predictors that are statically significant in regards to life expectancy, where stepwise and backward had the same results.
Variables StatusDeveloping -1.923054
Adult.Mortality -0.016917
infant.deaths 0.086154
under.five.deaths -0.067124
Total.expenditure 0.250440
Diphtheria 0.028511
HIV.AIDS -1.020423
Income.composition.of.resources 27.870308
adjusted R^2: 0.83 AIC: 992.018
While the other variables intuitively may appear significant, the variables above are the predictors that have significant impact according to the regression models. interestingly enough, all models resulted in the same variables being returned.
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Total.expenditure +
## Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources +
## Life.expectancy.category, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1994 -1.8642 0.0611 1.6677 8.9195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.170e+01 3.177e+00 19.423 < 2e-16 ***
## StatusDeveloping -2.420e+00 8.076e-01 -2.997 0.00313 **
## Adult.Mortality -1.459e-02 3.633e-03 -4.017 8.77e-05 ***
## Total.expenditure 2.649e-01 9.709e-02 2.729 0.00702 **
## Diphtheria 1.770e-02 1.160e-02 1.526 0.12872
## HIV.AIDS -5.715e-01 2.472e-01 -2.312 0.02197 *
## GDP 2.888e-05 1.706e-05 1.693 0.09222 .
## thinness.5.9.years -1.700e-01 6.955e-02 -2.444 0.01553 *
## Income.composition.of.resources 1.895e+01 3.274e+00 5.787 3.30e-08 ***
## Life.expectancy.categoryLow -5.693e+00 1.008e+00 -5.649 6.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 173 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8527
## F-statistic: 118 on 9 and 173 DF, p-value: < 2.2e-16
## [1] 966.4646
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling + Life.expectancy.category,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1058 -1.6731 0.2159 1.5864 8.8874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.250e+01 3.433e+00 18.206 < 2e-16 ***
## StatusDeveloping -2.137e+00 8.647e-01 -2.472 0.014480 *
## Adult.Mortality -1.494e-02 3.812e-03 -3.919 0.000131 ***
## infant.deaths 3.532e-02 5.627e-02 0.628 0.531077
## Alcohol 1.012e-01 8.060e-02 1.256 0.210921
## percentage.expenditure 5.173e-05 2.357e-04 0.219 0.826571
## Hepatitis.B -1.146e-02 2.112e-02 -0.543 0.588053
## Measles -2.369e-05 4.766e-05 -0.497 0.619749
## BMI -5.531e-03 1.562e-02 -0.354 0.723766
## under.five.deaths -2.839e-02 3.896e-02 -0.729 0.467238
## Polio 2.512e-05 1.945e-02 0.001 0.998971
## Total.expenditure 2.299e-01 1.041e-01 2.209 0.028587 *
## Diphtheria 2.606e-02 2.378e-02 1.096 0.274780
## HIV.AIDS -5.667e-01 2.585e-01 -2.192 0.029778 *
## GDP 2.068e-05 3.627e-05 0.570 0.569358
## Population 2.374e-09 6.398e-09 0.371 0.711052
## thinness..1.19.years 7.806e-03 2.346e-01 0.033 0.973500
## thinness.5.9.years -1.774e-01 2.326e-01 -0.763 0.446740
## Income.composition.of.resources 1.880e+01 5.699e+00 3.299 0.001195 **
## Schooling -3.827e-02 2.313e-01 -0.165 0.868812
## Life.expectancy.categoryLow -5.430e+00 1.118e+00 -4.858 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.351 on 162 degrees of freedom
## Multiple R-squared: 0.8636, Adjusted R-squared: 0.8468
## F-statistic: 51.3 on 20 and 162 DF, p-value: < 2.2e-16
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Total.expenditure +
## Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources +
## Life.expectancy.category, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1994 -1.8642 0.0611 1.6677 8.9195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.170e+01 3.177e+00 19.423 < 2e-16 ***
## StatusDeveloping -2.420e+00 8.076e-01 -2.997 0.00313 **
## Adult.Mortality -1.459e-02 3.633e-03 -4.017 8.77e-05 ***
## Total.expenditure 2.649e-01 9.709e-02 2.729 0.00702 **
## Diphtheria 1.770e-02 1.160e-02 1.526 0.12872
## HIV.AIDS -5.715e-01 2.472e-01 -2.312 0.02197 *
## GDP 2.888e-05 1.706e-05 1.693 0.09222 .
## thinness.5.9.years -1.700e-01 6.955e-02 -2.444 0.01553 *
## Income.composition.of.resources 1.895e+01 3.274e+00 5.787 3.30e-08 ***
## Life.expectancy.categoryLow -5.693e+00 1.008e+00 -5.649 6.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 173 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8527
## F-statistic: 118 on 9 and 173 DF, p-value: < 2.2e-16
## CV AIC AICc BIC AdjR2
## 11.7114475 447.1331296 448.6769892 482.4374772 0.8526682
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
We assume that the collected data set is not accurate for percentage expenditure column. If the definition of percentage expenditure is Expenditure on health as a percentage of Gross Domestic Product per capita(%), then it can never be more than 100%. Also, it is higly unlikely that any country would spend 100% of its GDP on healthcare.
So, ignoring percentage expenditure column in analysis. We are using only Total expenditure for finding out if it has an impact on life expectancy value.
From EDA and 2-sample t-test, we see that life expectancy value does not have statisctical significance on healthcare expenditure.
t = 1.9583, df = 181, p-value = 0.05173
95 percent confidence interval: -0.00710969, 1.88675914 Since 0 is one of the plausible values, so we can say that effect of Total expenditure on life expectancy value greater than 65 and less than 65 is not statistically significant.
##
## Two Sample t-test
##
## data: Total.expenditure by Life.expectancy.category
## t = 1.9583, df = 181, p-value = 0.05173
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.00710969 1.88675914
## sample estimates:
## mean in group High mean in group Low
## 6.411556 5.471732
##
## Classification tree:
## tree(formula = Life.expectancy.category ~ Total.expenditure,
## data = df1_complete)
## Number of terminal nodes: 6
## Residual mean deviance: 0.9451 = 167.3 / 177
## Misclassification error rate: 0.1858 = 34 / 183
From EDA, effect of Infant death does not look significant on life expectancy. But we see that Adult.Mortality rate is negatively coreelated Life Expectancy,
The relationship between Adult.Mortality and life expectancy can be modeled by the regression equation below:
life expectancy = 80.64428 + -0.06125 (Adult Mortality)
We notice that as Adult.Mortality value increase by an one, life expectancy is expected to decrese by 0.06125 years. With no Adult.Mortality, a life expectancy is exepcted to be 80.64428 years. Adjusted R^2: 0.5732
And when we used Adult.Mortality and Infant death together, it did not meet linear regression model assumption.
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9654 -2.5457 0.8639 3.2843 13.1335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.64428 0.71342 113.04 <2e-16 ***
## Adult.Mortality -0.06125 0.00391 -15.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.593 on 181 degrees of freedom
## Multiple R-squared: 0.5755, Adjusted R-squared: 0.5732
## F-statistic: 245.4 on 1 and 181 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4677 -2.6017 0.6429 3.1007 12.9854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.675651 0.706482 114.193 <2e-16 ***
## Adult.Mortality -0.059758 0.003933 -15.194 <2e-16 ***
## infant.deaths -0.010334 0.004791 -2.157 0.0323 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.537 on 180 degrees of freedom
## Multiple R-squared: 0.5862, Adjusted R-squared: 0.5816
## F-statistic: 127.5 on 2 and 180 DF, p-value: < 2.2e-16
We assume that BMI rate is related to lifestyle, eating habits, exercise. Since we see a positive co-relation of BMI with Life Expectancy, we assume that with good eating habits, ample exercice and healthy lifestyle life expectancy would be more.
The summary statistics shows high significants of alpha less than 0.001. The relationship can be modeled by the regression equation below:
life expectancy = 63.57 + 0.19423 (BMI)
We notice that as BMI values increase by an unit, life expectancy is expected to increase by 0.19423 years. With no BMI, a life expectancy is exepcted to be 63.57 years (which practially does not make sense). Adjusted R^2: 0.2226
Note: Alcohol effect is covered in question: 6.
##
## Call:
## lm(formula = Life.expectancy ~ BMI, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.0898 -4.6841 0.3523 4.3841 24.0927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.56713 1.22767 51.779 < 2e-16 ***
## BMI 0.19423 0.02665 7.288 9.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.548 on 181 degrees of freedom
## Multiple R-squared: 0.2269, Adjusted R-squared: 0.2226
## F-statistic: 53.11 on 1 and 181 DF, p-value: 9.42e-12
Life expectancy and schooling have a positive linear relationship. Despite not being significant in the full model, schooling is a significant indicator when its modeled as a singly linear regression.
The summary statistics shows high significants of alpha less than 0.001. The relationship can be modeled by the regression below:
life expectancy = 41.42 + 2.34 (schooling)
We notice that as schooling increases by a year, life expectancy is increased by 2.5 years, with no schooling having a life expectancy of 38.72 years. Adjusted R^2: 0.5949
##
## Call:
## lm(formula = Life.expectancy ~ Schooling, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7366 -2.9955 0.4148 3.7260 11.2971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.4171 1.8826 22.00 <2e-16 ***
## Schooling 2.3371 0.1427 16.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.449 on 181 degrees of freedom
## Multiple R-squared: 0.5971, Adjusted R-squared: 0.5949
## F-statistic: 268.2 on 1 and 181 DF, p-value: < 2.2e-16
Life expectancy has a slight positive increase with alcohol consumption.
life expectancy = 68.03 + 1.07(Alcohol)
Notice that as alcohol consumption increases, life expectancy incrases by 1.07 years, starting at 68.01 years expected if no alcohol is consumed.
This is a bit counter intuitive considering the knowledge that alcohol is not considered to be healthy and many studies suggest that alcohol could shorten life spans. Keeping this in coonsideration, additional studies may need to be conducted.
In addition, the model does not meet the require assumptions for linear regression. Specifically, the model lacks constant variance, hence it is not correct.
##
## Call:
## lm(formula = Life.expectancy ~ Alcohol, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.2658 -4.4254 0.7636 5.5977 15.4636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.0257 0.6916 98.362 < 2e-16 ***
## Alcohol 1.0732 0.1312 8.179 4.89e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.335 on 181 degrees of freedom
## Multiple R-squared: 0.2699, Adjusted R-squared: 0.2658
## F-statistic: 66.9 on 1 and 181 DF, p-value: 4.887e-14
Initally, there is a significant outlier that skews the data dramatically. Once removed, we noticed the sloped droped by nearly 50%. Considering that there is the possibility of having significanlt high populations, it has been concluded to keep the data point in.
life expectancy = 71.71 - 1.12e-8(Population)
Notice that as the population in a country increases, life expectancy decreases by 2.65 years, starting at 70.58 years expected if there is no population. In this scenario, the intercept independent of the slope has no logical reasoning considering that no population would result in no life expectancy.
We notice again that the model fails constant variance, hence, the model cannot be utilized.
##
## Call:
## lm(formula = Life.expectancy ~ Population, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.507 -5.987 1.991 5.313 17.409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.161e+01 6.484e-01 110.44 <2e-16 ***
## Population -3.475e-09 6.440e-09 -0.54 0.59
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.578 on 181 degrees of freedom
## Multiple R-squared: 0.001606, Adjusted R-squared: -0.00391
## F-statistic: 0.2911 on 1 and 181 DF, p-value: 0.5902
##
## Call:
## lm(formula = Life.expectancy ~ Population, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.606 -6.029 2.172 5.398 17.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.171e+01 7.120e-01 100.7 <2e-16 ***
## Population -1.128e-08 2.257e-08 -0.5 0.618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.598 on 180 degrees of freedom
## Multiple R-squared: 0.001387, Adjusted R-squared: -0.004161
## F-statistic: 0.25 on 1 and 180 DF, p-value: 0.6177
Reviewing the graphs, there does not appear to be a significant relationship between life expectancy and immunization.
##
## Call:
## lm(formula = Life.expectancy ~ ., data = df_immunizations)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.1606 -5.6656 0.6302 4.6017 24.9821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.68785 2.54659 22.260 < 2e-16 ***
## Hepatitis.B -0.01012 0.04671 -0.217 0.82871
## Polio 0.11826 0.04232 2.794 0.00577 **
## Diphtheria 0.06744 0.05293 1.274 0.20425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.83 on 179 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1634
## F-statistic: 12.85 on 3 and 179 DF, p-value: 1.212e-07
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## Call:
## lm(formula = log.Life.expectancy ~ log.Hepatitis.B, data = df_immunizations)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39440 -0.07385 0.02842 0.07380 0.23821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.05582 0.06302 64.358 <2e-16 ***
## log.Hepatitis.B 0.04795 0.01446 3.317 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1211 on 181 degrees of freedom
## Multiple R-squared: 0.0573, Adjusted R-squared: 0.05209
## F-statistic: 11 on 1 and 181 DF, p-value: 0.0011
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## Call:
## lm(formula = Life.expectancy ~ Hepatitis.B + I(Hepatitis.B^2),
## data = df_immunizations)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6631 -5.3127 0.0933 4.5659 19.2030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.8849360 2.9985672 22.973 < 2e-16 ***
## Hepatitis.B -0.2710250 0.1123556 -2.412 0.016861 *
## I(Hepatitis.B^2) 0.0033928 0.0009535 3.558 0.000477 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.902 on 180 degrees of freedom
## Multiple R-squared: 0.1574, Adjusted R-squared: 0.148
## F-statistic: 16.81 on 2 and 180 DF, p-value: 2.027e-07
## [1] 1280.866
## [1] 1293.704
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
Display the ability to build regression models using the skills and discussions from Unit 1 and 2 with the purpose of identifying key relationships, interpreting those relationships, and making good predictions.
Reminder, key here is to tell a good story.
Unit 2 Objectives: - bias vs. variance - complexity - LASSO/LARS, CV (cross validation)
LASSO Using the lasso regression method, the following were determined to be significant Income.composition.of.resources While there were other variables that made the model (total expenditure, HIV.AIDs, and BMI), their coefficients are extremely small and not considered to be significant.
Life expectancy = 37.72 + 49.11 (income)
R^2: 0.7317
CV: 9.896252
Notice that there is a drop in R^2 as opposed to the linear model, but there is significantly less complexity.
## [1] 9.782965
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 62.1431290847
## StatusDeveloping -1.9342221405
## Adult.Mortality -0.0188318537
## infant.deaths .
## Alcohol 0.0024896480
## percentage.expenditure .
## Hepatitis.B .
## Measles .
## BMI .
## under.five.deaths -0.0006208113
## Polio .
## Total.expenditure 0.4010933313
## Diphtheria 0.0031529074
## HIV.AIDS .
## GDP .
## Population .
## thinness..1.19.years .
## thinness.5.9.years -0.1168038289
## Income.composition.of.resources 18.5336459933
## Schooling .
## Life.expectancy.categoryLow -7.2540778170
## RMSE Rsquare
## 1 3.127773 0.8540002
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7366 -2.0664 0.0655 2.3525 10.4634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.723 1.551 24.32 <2e-16 ***
## Income.composition.of.resources 49.119 2.202 22.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared: 0.7332, Adjusted R-squared: 0.7317
## F-statistic: 497.4 on 1 and 181 DF, p-value: < 2.2e-16
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
2. Report predictive ability a. Test/train set b. CV data
## 2.5 % 97.5 %
## (Intercept) 33.81911 40.36032
## Income.composition.of.resources 45.63222 54.82424
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources,
## data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3665 -1.9079 0.1418 2.1354 10.3335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.090 1.655 22.41 <2e-16 ***
## Income.composition.of.resources 50.228 2.325 21.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.237 on 144 degrees of freedom
## Multiple R-squared: 0.7642, Adjusted R-squared: 0.7625
## F-statistic: 466.6 on 1 and 144 DF, p-value: < 2.2e-16
## [1] 839.9218
## [1] 848.8726
## CV AIC AICc BIC AdjR2
## 18.1672688 425.5917578 425.7607719 434.5425777 0.7625353
## fit lwr upr
## 6 78.52801 77.61430 79.44173
## 17 72.50062 71.80614 73.19511
## 22 72.09880 71.40566 72.79193
## 28 71.66649 70.97263 72.36035
## 29 69.38647 68.65264 70.12031
## 34 56.67873 55.11308 58.24438
## 52 71.04401 70.34516 71.74285
## 53 66.32255 65.45521 67.18990
## 56 58.93900 57.55577 60.32223
## 62 75.21295 74.46083 75.96507
## 67 67.92986 67.14211 68.71760
## 69 58.13535 56.68801 59.58269
## 79 82.79741 81.59214 84.00268
## 82 73.60565 72.89755 74.31374
## 84 74.10793 73.38901 74.82685
## 89 70.03944 69.32301 70.75587
## 90 65.87050 64.97767 66.76332
## 92 75.41386 74.65442 76.17331
## 96 79.33166 78.36842 80.29491
## 101 71.89789 71.20470 72.59108
## 113 68.83396 68.08200 69.58592
## 114 64.76548 63.80570 65.72526
## 117 69.03488 68.28986 69.77990
## 118 54.41846 52.66497 56.17195
## 119 63.25863 62.19879 64.31847
## 120 84.55540 83.21533 85.89547
## 124 62.75635 61.66132 63.85137
## 125 71.64674 70.95280 72.34069
## 143 75.56455 74.79936 76.32973
## 153 75.26318 74.50926 76.01709
## 155 73.35450 72.65075 74.05826
## 160 68.33168 67.56073 69.10263
## 172 78.87961 77.94462 79.81461
## 176 76.82025 75.99976 77.64075
## 177 71.74720 71.05365 72.44075
## 178 67.02575 66.19549 67.85600
## 183 62.10338 60.96147 63.24529
Comparing the models
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7366 -2.0664 0.0655 2.3525 10.4634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.723 1.551 24.32 <2e-16 ***
## Income.composition.of.resources 49.119 2.202 22.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared: 0.7332, Adjusted R-squared: 0.7317
## F-statistic: 497.4 on 1 and 181 DF, p-value: < 2.2e-16
Interpret the coefficients Life expectancy has a linear relationship iwth income/composition of resource. where as the income index increases, life expectancy increases by 50.2. It should be noted that income/composition of resources ranges from 0 to 1, where the maximum life expectancy if 87.3. This is not fully realistic considering that many may live past this age. In addition, if there is no income or composition of resources, it is expected that the life expectancy is 37.1. While this also may not be applicable, this could pertain to those who are considered unemployed without any income source.
Confidence intervals
## 2.5 % 97.5 %
## (Intercept) 33.81911 40.36032
## Income.composition.of.resources 45.63222 54.82424
- Product the best predictions as possible
- Interpretation is no longer required, hence complexity is no longer an issue
A. Linear Regression - model a: linear regression life expectancy = 36.55 + 50.73(income) Adjusted R^2 = 0.79 - model b: linear regression + adult mortality life expectancy = 48.5 + 38.77(income) - 0.025 (adult mortality) Adjusted R^2 = 0.84 - model c: linear regression + adult mortality + HIV.AIDS life expectancy = 49.8 + 36.05 (income) - 0.016 (adult mortality) - 0.95 (HIV/AIDS) Adjusted R^2 = 0.85 B. Interaction Terms - model d: linear regression + adult mortality + HIV.AIDS
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7366 -2.0664 0.0655 2.3525 10.4634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.723 1.551 24.32 <2e-16 ***
## Income.composition.of.resources 49.119 2.202 22.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared: 0.7332, Adjusted R-squared: 0.7317
## F-statistic: 497.4 on 1 and 181 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## Adult.Mortality, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3882 -1.6648 0.1117 1.9530 10.2110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.357043 2.307517 21.823 < 2e-16 ***
## Income.composition.of.resources 36.398916 2.705873 13.452 < 2e-16 ***
## Adult.Mortality -0.026076 0.003809 -6.847 1.15e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.961 on 180 degrees of freedom
## Multiple R-squared: 0.7883, Adjusted R-squared: 0.786
## F-statistic: 335.2 on 2 and 180 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## Adult.Mortality + Total.expenditure, data = df.expenditure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.4690 -1.6843 0.4308 2.5825 8.6144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.874569 3.752761 13.024 < 2e-16 ***
## Income.composition.of.resources 34.237635 4.879284 7.017 5.66e-10 ***
## Adult.Mortality -0.022318 0.005845 -3.818 0.000258 ***
## Total.expenditure 0.327685 0.185466 1.767 0.080934 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.718 on 83 degrees of freedom
## Multiple R-squared: 0.6884, Adjusted R-squared: 0.6771
## F-statistic: 61.12 on 3 and 83 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## Adult.Mortality + HIV.AIDS, data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.206 -1.834 -0.020 2.010 10.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.554312 2.216126 22.812 < 2e-16 ***
## Income.composition.of.resources 35.477611 2.608106 13.603 < 2e-16 ***
## Adult.Mortality -0.018297 0.004135 -4.425 1.67e-05 ***
## HIV.AIDS -1.055254 0.261798 -4.031 8.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 179 degrees of freedom
## Multiple R-squared: 0.8059, Adjusted R-squared: 0.8027
## F-statistic: 247.8 on 3 and 179 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## Adult.Mortality + HIV.AIDS + (Adult.Mortality * HIV.AIDS) +
## (Income.composition.of.resources * Adult.Mortality) + (Income.composition.of.resources *
## HIV.AIDS), data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3664 -1.8844 -0.1826 1.8580 9.9718
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 52.998853 3.048212 17.387
## Income.composition.of.resources 33.733828 3.890978 8.670
## Adult.Mortality -0.032444 0.015087 -2.151
## HIV.AIDS 2.045694 1.958163 1.045
## Adult.Mortality:HIV.AIDS 0.004782 0.001708 2.799
## Income.composition.of.resources:Adult.Mortality 0.012899 0.023369 0.552
## Income.composition.of.resources:HIV.AIDS -9.070358 3.124681 -2.903
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Income.composition.of.resources 2.78e-15 ***
## Adult.Mortality 0.03288 *
## HIV.AIDS 0.29759
## Adult.Mortality:HIV.AIDS 0.00569 **
## Income.composition.of.resources:Adult.Mortality 0.58168
## Income.composition.of.resources:HIV.AIDS 0.00417 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.533 on 176 degrees of freedom
## Multiple R-squared: 0.8353, Adjusted R-squared: 0.8297
## F-statistic: 148.7 on 6 and 176 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## Adult.Mortality + (Income.composition.of.resources * HIV.AIDS),
## data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2720 -1.8066 -0.1358 1.5006 9.7367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.328934 2.339859 20.655 < 2e-16
## Income.composition.of.resources 39.461755 2.795040 14.118 < 2e-16
## Adult.Mortality -0.022092 0.004253 -5.195 7.03e-07
## HIV.AIDS 4.341832 1.437251 3.021 0.002993
## Income.composition.of.resources:HIV.AIDS -9.631094 2.763085 -3.486 0.000655
##
## (Intercept) ***
## Income.composition.of.resources ***
## Adult.Mortality ***
## HIV.AIDS **
## Income.composition.of.resources:HIV.AIDS ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.504 on 141 degrees of freedom
## Multiple R-squared: 0.8421, Adjusted R-squared: 0.8376
## F-statistic: 187.9 on 4 and 141 DF, p-value: < 2.2e-16
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
## 2.5 % 97.5 %
## (Intercept) 43.70319355 52.95467477
## Income.composition.of.resources 33.93615217 44.98735854
## Adult.Mortality -0.03049973 -0.01368488
## HIV.AIDS 1.50048435 7.18317931
## Income.composition.of.resources:HIV.AIDS -15.09352371 -4.16866375
## Models CV AIC BIC AdjR2 Accuracy
## 1 model1 19.84151 549.0920 558.7205 0.7317107 0.76
## 2 model2 12.74281 373.0652 390.9669 0.8375767 0.85
We notice that model 2 (the more complex of the two models) has a lower CV PRESS and higher adjusted R2. While model 1 is simple to comprehend, model 2 has higher predictability powers.
- Nonparametric technique
- kNN or regression trees (select one)
##
## Call:
## randomForest(formula = Life.expectancy ~ ., data = trainingData)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 7.513744
## % Var explained: 89.99
##
## Call:
## randomForest(formula = Life.expectancy ~ ., data = new.train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 9.186914
## % Var explained: 87.76
## NULL